Computational discovery of human coding and non-coding transcripts with conserved splice sites
نویسندگان
چکیده
MOTIVATION Long non-coding RNAs (lncRNAs) resemble protein-coding mRNAs but do not encode proteins. Most lncRNAs are under lower sequence constraints than protein-coding genes and lack conserved secondary structures, making it hard to predict them computationally. RESULTS We introduce an approach to predict spliced lncRNAs in vertebrate genomes combining comparative genomics and machine learning. It is based on detecting signatures of characteristic splice site evolution in vertebrate whole genome alignments. First, we predict individual splice sites, then assemble compatible sites into exon candidates, and finally predict multi-exon transcripts. Using a novel method to evaluate typical splice site substitution patterns that explicitly takes the species phylogeny into account, we show that individual splice sites can be accurately predicted. Since our approach relies only on predicted splice sites, it can uncover both coding and non-coding exons. We show that our predicted exons and partial transcripts are mostly non-coding and lack conserved secondary structures. These exons are of particular interest, since existing computational approaches cannot detect them. Transcriptome sequencing data indicate tissue-specific expression patterns of predicted exons and there is evidence that increasing sequencing depth and breadth will validate additional predictions. We also found a significant enrichment of predicted exons that form multi-exon transcript parts, and we experimentally validate such a novel multi-exon gene. Overall, we obtain 336 novel multi-exon transcript predictions from human intergenic regions. Our results indicate the existence of novel human transcripts that are conserved in evolution and our approach contributes to the completion of the human transcript catalog. AVAILABILITY AND IMPLEMENTATION Predicted human splice sites, exons and gene structures together with a Perl implementation of the tree-based log-odds scoring and a supplementary PDF file containing additional figures and tables are available at: http://www.bioinf.uni-leipzig.de/publications/supplements/10-010. The five experimentally confirmed partial transcript isoforms have been deposited in GenBank under accession numbers HM587422-HM587426.
منابع مشابه
Long non-coding RNAs and their significance in human diseases
Protein-coding genes account for only a small fraction of the human genome and most of the genomic sequences are transcriptionally silent, but recent observations indicate significant functional elements, including non-coding protein transcripts in the human genome. Long non-coding RNAs (lncRNAs) have been defined as transcripts of >200 nucleotides without protein-coding capacity that perform t...
متن کاملSNHG6 203 and SNHG6 201 Transcripts Can be Used as Contributory Factors for a Well-Timed Prognosis and Diagnosis of Colorectal Cancer
Background:Long non-coding RNAs, as a big part of non-coding RNAs, are considered functionally more than past. These transcripts could be involved in carcinogenesis. SNHG6, as a long non-coding RNA, has been reported to be expressed more in colorectal cancer tissues than non-cancerous ones. Colorectal cancer as a malignancy needs fast prognostic and diagnostic methods for well...
متن کاملPrediction of Structural Elements in Long Non-Coding RNAs using RNAz
In this paper we present an analysis of human long intergenic non-coding RNAs transcripts (8195 transcripts of hg19). The key problem of this kind of RNAs is that they do not have common statistically significant features in their primary sequence (e.g. open reading frames or codon bias). Therefore, the analysis was done by the tool RNAz which could solve this problem by employing comparative g...
متن کاملThe Roles of Long non-coding RNAs (lncRNA) in Prostate Cancer
Background & Objective: Prostate cancer is a compound condition in which gene expression has altered. Several surveys have revealed that genetic components have been involved in prostate cancer progression. Findings proposed that they can modify a noteworthy portion of disposing of elements, which is associated to the developing prostate cancer in protein coding sequences. The purpose of this r...
متن کاملP87: The Role of the Long Non-Coding RNA Sequences (LncRNAs) in Neurological Disorders
Precise interpretation of the transcriptome sequences in the several species showed that the major part of genome has been transcribed; however, just a few amounts of the transcription sequences have open-reading frames which are conversed during the evolution. So, it is unlikely that many of the transcribed sequences code the proteins. Among the all human non-coding transcripts, at least 10000...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 27 14 شماره
صفحات -
تاریخ انتشار 2011